plotting with quanteda

library(quanteda)

At the moment, two quanteda objects, dfm and kwic have custom plot methods: dfm is plotted as a wordcloud, kwic as a lexical dispersion plot. There are also other plots of interest which can be made with the standard R techniques.

 1. Wordcloud

Plotting a dfm object will create a wordcloud using the wordcloud pacakge.

inaugDfm <- dfm(
                inaugCorpus[0:10],
                remove.features=stopwords('english')
            ) # Create a dfm from a somewhat smaller corpus
## 
##    ... lowercasing
##    ... tokenizing
## Warning in tokenize.character(x, removeNumbers = removeNumbers,
## removeSeparators = removeSeparators, : Argument remove.features not used.
## 
##    ... indexing documents: 10 documents
##    ... indexing features: 3,346 feature types
##    ... created a 10 x 3347 sparse dfm
##    ... complete. 
## Elapsed time: 0.059 seconds.
suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
                 plot(inaugDfm)
)

You can also plot a “comparison cloud”, but this can only be done with fewer than eight documents:

firstDfm <- dfm(texts(inaugCorpus)[0:8])
## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 8 documents
##    ... indexing features: 2,668 feature types
##    ... created a 8 x 2669 sparse dfm
##    ... complete. 
## Elapsed time: 0.034 seconds.
suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
  plot(firstDfm, comparison=T)
)

Plot will pass through additional arguments to the underlying call to wordcloud.

suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
  plot(inaugDfm, 
    colors=c('red', 'yellow', 'pink', 'green', 'purple', 'orange', 'blue')
 )
)

2. Lexical dispersion plot

Plotting a kwic object produces a lexical dispersion plot which allows us to visualize the occurrences of particular terms throughout the text.

plot(kwic(inaugCorpus, "american"))

You can also pass multiple kwic objects to plot to compare the dispersion of different terms:

plot(
     kwic(inaugCorpus, "american"),
     kwic(inaugCorpus, "people"),
     kwic(inaugCorpus, "communist")
)

If you’re only plotting a single document, but with multiple keywords, then the keywords are displayed one below the other rather than side-by-side.

  mobydickCorpus <- corpus(mobydickText)

  plot(
       kwic(mobydickCorpus, "whale"),
       kwic(mobydickCorpus, "ahab")
  )

You might also have noticed that the x-axis scale is the absolute token index for single texts and relative token index when multiple texts are being compared. If you prefer, you can specify that you want an absolute scale:

plot(
     kwic(inaugCorpus, "american"),
     kwic(inaugCorpus, "people"),
     kwic(inaugCorpus, "communist"),
     scale='absolute'
)

In this case, the texts may not have the same length, so the tokens that don’t exist in a particular text are shaded in grey.

Modifying lexical dispersion plots

The object returned is a ggplot object, which can be modified using ggplot:

library(ggplot2)
theme_set(theme_bw())
g <- plot(
     kwic(inaugCorpus, "american"),
     kwic(inaugCorpus, "people"),
     kwic(inaugCorpus, "communist")
)
g + aes(color = keyword) + scale_color_manual(values = c('blue', 'red', 'green'))

3. Frequency plots

You can plot the frequency of the top features in a text using topfeatures.

inaugFeatures <- topfeatures(inaugDfm, 100)

# Create a data.frame for ggplot
topDf <- data.frame(
  list(
    term = names(inaugFeatures),
    frequency = unname(inaugFeatures)
    )
)

# Sort by reverse frequency order
topDf$term <- with(topDf, reorder(term, -frequency))

ggplot(topDf) + geom_point(aes(x=term, y=frequency)) +
  theme(axis.text.x=element_text(angle=90, hjust=1))

If you wanted to compare the frequency of a single term across different texts, you could plot the dfm matrix like this:

  americanFreq <- data.frame(list(
    document = rownames(inaugDfm[, 'american']),
    frequency = unname(as.matrix(inaugDfm[, 'american']))
  ))

  ggplot(americanFreq) + geom_point(aes(x=document,y=frequency)) +
    theme(axis.text.x=element_text(angle=90, hjust=1))

The above plots are raw frequency plots. For relative frequency plots, (word count divided by the length of the chapter) we can weight the document-frequency matrix. To obtain expected word frequency per 100 words, we multiply by 100.

relDfm <- weight(inaugDfm, type='relFreq') * 100
head(relDfm)
## Document-feature matrix of: 10 documents, 3,347 features.
## (showing first 6 documents and first 6 features)
##                  features
## docs              fellow-citizens       of      the     senate      and
##   1789-Washington      0.06993007 4.965035 8.111888 0.06993007 3.356643
##   1793-Washington      0.00000000 8.148148 9.629630 0.00000000 1.481481
##   1797-Adams           0.12942192 6.039689 7.031924 0.04314064 5.608283
##   1801-Jefferson       0.11587486 6.025492 7.531866 0.00000000 4.692932
##   1805-Jefferson       0.00000000 4.662973 6.602031 0.00000000 4.293629
##   1809-Madison         0.08510638 5.872340 8.851064 0.00000000 3.659574
##                  features
## docs                  house
##   1789-Washington 0.1398601
##   1793-Washington 0.0000000
##   1797-Adams      0.0000000
##   1801-Jefferson  0.0000000
##   1805-Jefferson  0.0000000
##   1809-Madison    0.0000000
relFreq <- data.frame(list(
  document = rownames(inaugDfm[, 'american']),
  frequency = unname(as.matrix(relDfm[, 'american']))
))

ggplot(relFreq) + geom_point(aes(x=document,y=frequency)) +
  theme(axis.text.x=element_text(angle=90, hjust=1))